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DETAILED ACTION 
Continued Examination Under 37 CFR 1.114 

1 . A request for continued examination under 37 CFR 1.114, including the fee set 
forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this 
application is eligible for continued examination under 37 CFR 1.114, and the fee set 
forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action 
has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on October 
25, 2005 has been entered. 

Response to Arguments 

2. Applicant's arguments filed October 25, 2005 have been fully considered but they 
are not persuasive. 

The newly added limitations of independent claims 1 and 18 require a circular 
buffer for "continuously receiving and maintaining the last few seconds of the acoustic 
response" that is supplied to an audio input device (emphasis added). The Applicant 
has alleged that the buffering system/method of Basu et al. does not meet this 
limitation. 

However, Basu et al. disclose that the microphone (audio input device) is turned 
on when a face identified as frontal is detected (column 15, lines 37-41). Prior to this 
point, the microphone is off and thus there is no acoustic response being supplied to the" 
audio input device. As soon as the microphone is turned on, and an acoustic response 
is generated, any signal received from the microphone is stored in the buffer (column 
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15, lines 41-44). Thus, while Basu et al. do not continuously leave the microphone on 
and store data in the buffer, the buffer continuously receives the acoustic response, 
since there is no acoustic response until the microphone is turned on, and as soon as 
the microphone is turned on, the buffer is activated. 

Regarding the limitation that the buffer maintains the "last few seconds" of the 
response, it is first noted that "last few seconds" is a vague and indefinite term. For this 
reason, claims 1 and 18 are rejected below under 35 U.S.C. 112, second paragraph. 
Additionally, Basu et al. disclose the buffered data is periodically removed from the 
buffer as the visual speech patterns are analyzed to determine the presence of speech. 
If the buffered data is tagged as speech it is passed on to the speech recognizer (and 
out of the buffer, column 15, lines 48-53). While Basu et al. do not explicitly state what 
happens when the acoustic response stored in the buffer is not tagged as speech, the 
data must be flushed out of the buffer, since, by definition, a buffer is a temporary 
storage space. Furthermore, given any reasonable reading of Basu et al., it is clear that 
the buffer is used to delay the processing of audio data until the associated video data 
can be processed. When implemented as a computer processing system "known to 

c 

those in the art" (see column 19, lines 6-21), this need for a delay would not be more 
than "a few seconds". 

Therefore, the rejection of claim 18 is maintained. 

■ 

Claim 1 has the additional requirement that the buffer be a "circular" buffer (note 
that in previous claims and in the specification, the synonymous term "ring buffer" is 
used). However, as admitted by the Applicant (see response to arguments in Office 
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Action with mailing date July 25, 2005), ring or circular buffers are well known in the art, 
are easy to implement, and ensure the buffer stays at a fixed size. Therefore, modifying 
the general buffer disclosed by Basu et al. to be a circular buffer would have been 
obvious to one of ordinary skill in the art at the time of invention. 

Claim Objections 

3. Claims 1 and 18 are objected to because of the following informalities: 
In line 14 of claim 1, after "maintaining" insert -toe- 
in line 5 of claim 18, after "maintaining" insert -the--. 

Appropriate correction is required. 

Claim Rejections - 35 USC §112 

4. The following is a quotation of the second paragraph of 35 U.S.C. 112: 

The specification shall conclude with one or more claims particularly pointing out and distinctly 
claiming the subject matter which the applicant regards as his invention. 

5. Claims 1 and 18 are rejected under 35 U.S.C. 112, second paragraph, as being 
indefinite for failing to particularly point out and distinctly claim the subject matter which 
applicant regards as the invention. 

As discussed above, the use of the term "last few seconds" is a relative term that 
renders the claims vague and indefinite. There is no description in the claims or 
specification to indicate what a "few" seconds would be (e.g. one or two, less than five, 
tens of seconds, hundreds of seconds, etc.). 
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Claim Rejections - 35 USC § 102 

6. The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that 
form the basis for the rejections under this section made in this Office action: 

A person shall be entitled to a patent unless - 

(e) the invention was described in (1) an application for patent, published under section 122(b), by 
another filed in the United States before the invention by the applicant for patent or (2) a patent 
granted on an application for patent by another filed in the United States before the invention by the 
applicant for patent, except that an international application filed under the treaty defined in section 
351(a) shall have the effects for purposes of this subsection of an application filed in the United States 
only if the international application designated the United States and was published under Article 21 (2) 
of such treaty in the English language. 

7. Claims 18-24 are rejected under 35 U.S.C. 102(e) as being anticipated by Basu 
et al. (U.S. Patent 6,594,629). 

In regard to claim 18, Basu et al. disclose a method of recognizing speech 
utterances of a speaker with an automatic speech recognizer responsive to acoustic 
speech utterances of the speaker comprising: 

detecting acoustic energy having a spectrum associated with speech utterances 
(event detection module 28 performing audio event detection, column 15, line 66 to 
column 16, line 4), 

continuously receiving and maintaining the last few seconds of acoustic energy 
(column 15, lines 37-56, see explanation above in Response to Arguments section), 

detecting at least one facial characteristic associated with speech utterances of 
the speaker (event detection module performing mouth opening detection, column 15, 
lines 41-42), and 

activating the automatic speech recognizer in response to the detected acoustic 
energy having a spectrum associated with speech utterances while the at least one 
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facial characteristic associated with speech utterances of the speaker is occurring 
(information from the audio event detection and the facial feature detection are used by 
event detection module 28 to turn search module 34 on or off, column 15, lines 19-22 
and lines 59-65, column 16, lines 53-56; the event detection module 28 inherently only 
indicates the user is making speech if both the audio and visual detectors detect speech 
events). 

In regard to claim 19, Basu et al. disclose search module 34 is turned on when a 
speech event is detected (column 15, lines 59-61). The event detection is a 
combination of the detection of a visual speech event (mouth opening) as well as the 
detection of an audio event (column 16, lines 53-56). The method disclosed by Basu et 
al. therefore comprises preventing activation of the automatic speech recognizer in 
response to any of: 

(a) no acoustic energy having a spectrum associated with speech utterances 
being detected while no facial characteristic associated with speech utterances of the 
speaker is detected, 

(b) acoustic energy having a spectrum associated with speech utterances being 
detected while no facial characteristic associated with speech utterances of the speaker 
is detected, and 

(c) no acoustic energy having a spectrum associated with speech utterances 
being detected while at least one facial characteristic associated with speech utterances 
of the speaker is detected 
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In regard to claim 20, Basu et al. disclose assuring that the beginning of each 
speech utterance is coupled to the speech recognizer (with the buffer, column 15, lines 
41-42 and lines 50-53). 

In regard to claim 21 , Basu et al. disclose the beginning of each speech 
utterance is assuredly coupled to the speech recognizer by: 

(a) delaying the speech utterance (in the buffer, column 15, lines 41-42), 

(b) recognizing the beginning of each speech utterance (with event detection 
module 28, using facial features and audio features, column 15, lines 41-42, column 15, 
line 66 to column 16, line 4, and column 16, lines 53-56), and 

(c) responding to the recognized beginning of each speech utterance to couple 
the delayed speech utterance associated with the beginning of each speech utterance 
to the speech recognizer and thereafter sequentially coupling the remaining delayed 
speech utterances to the speech recognizer (column 15, lines 50-53). 

In regard to claim 22, Basu et al. disclose assuring that no detected acoustic 

energy is coupled to the speech recognizer upon the completion of an utterance 

(recognition is performed for each piece of buffered data, until no speech is uttered, 

« 

column 15, lines 53-55). 
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In regard to claim 23, Basu et al. disclose assurance that no detected acoustic 
energy is coupled to the speech recognizer upon the completion of a speech utterance 
is provided by: 

(a) delaying the acoustic energy associated with the speech utterance (in the 
buffer, column 15, lines 41-42), 

(b) recognizing the completion of each speech utterance (no more speech event, 
column 15, lines 53-55 and 59-61), and 

(c) responding to the recognized completion of each speech utterance to 
decouple delayed acoustic energy occurring after the completion of each speech 
utterance from the speech recognizer (the process is only repeated until no more 
speech events are detected, then search module 34 is turned off, column 15, lines 53- 
55 and 59-61). 

In regard to claim 24, Basu et al. disclose the at least one facial characteristic 
indicates the face of the speaker has a predetermined orientation relative to a detector 
involved in the step of detecting the at least one facial characteristic (a face is identified 
as 'frontal' facing, column 15, lines 37-38). 

Claim Rejections - 35 USC § 103 

8. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set 
forth in section 102 of this title, if the differences between the subject matter sought to be patented and 
the prior art are such that the subject matter as a whole would have been obvious at the time the 
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invention was made to a person having ordinary skill in the art to which said subject matter pertains. 
Patentability shall not be negatived by the manner in which the invention was made. 

9. Claims 1, 2, 6, 7, 9, 12, and 13 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Basu et al., in view of the Applicant's Admitted Prior Art. 

* 

In regard to claim 1 , Basu et al. disclose a speech recognition system (Fig. 1) 
comprising: 

an acoustic detector for detecting speech utterances of a speaker using an audio 
input device (microphone; event detection module 28 performing audio event detection, 
column 15, line 66 to column 16, line 4); 

a visual detector for detecting at least one facial characteristic associated with 
speech utterances of the speaker (event detection module performing mouth opening 
detection, column 15, lines 41-42); 

a processing arrangement connected to be responsive to the acoustic and visual 
detectors for deriving a signal having first and second values respectively indicative of 
the speaker making and not making speech utterances such that the first value is 
derived in response to the acoustic detector detecting a finite, nonzero acoustic 
response while the visual detector detects at least one facial characteristic associated 
with speech utterances of the speaker (information from the audio event detection and 
the facial feature detection are used by event detection module 28 to turn search 
module 34 on or off, column 15, lines 19-22 and lines 59-65, column 16, lines 53-56; the 
event detection module 28 inherently only indicates the user is making speech if both 
the audio and visual detectors detect speech events); 
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said processing arrangement comprising a buffer for continuously receiving and 
maintaining the last few seconds of the acoustic response (column 15, lines 37-56, see 
explanation above in Response to Arguments section), and 

a speech recognizer for deriving an output indicative of the speech utterances as 
detected by the acoustic detector, the speech recognizer being connected to be 
responsive to the acoustic detector while the signal has the first value (search module 
34 is turned on when a speech event is detected by event detection module 28, column 
15, lines 59-61). 

Basu et al. does not disclose that the buffer is a circular buffer. 

The Applicant's admitted prior art discloses to use circular buffers when buffering 
data because they are easy to implement and ensure the buffer stays at a fixed size. 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu et al. so that the buffer was a circular buffer because circular 
buffers are easy to implement and ensure the buffer stays at a fixed size 

In regard to claim 2, Basu et al. disclose search module 34 is turned on when a 
speech event is detected (column 15, lines 59-61). The event detection is a 
combination of the detection of a visual speech event (mouth opening) as well as the 
detection of an audio event (column 16, lines 53-56). The signal output by search 
module 34, therefore, is the second value (off) in response to any of: 

a) the acoustic detector not detecting a finite, nonzero acoustic response while 
the visual detector does not detect speech utterances of the speaker, 
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(b) the acoustic detector detecting a finite, nonzero acoustic response while the 
visual detector does not detect speech utterances of the speaker, and 

(c) the acoustic detector not detecting a finite, nonzero acoustic response while 
the visual detector detects speech utterances of the speaker. 

» 

In regard to claim 6, Basu et al. disclose the buffer assures that the beginning of 
each speech utterance is coupled to the speech recognizer (speech is collected in the 
buffer, so that the speech can be sent for recognition if it is determined to be speech, 
column 15, lines 41-42 and lines 50-53). 

Basu et al. does not disclose that the buffer is a circular buffer. 

The Applicant's admitted prior art discloses to use circular buffers when buffering 
data because they are easy to implement and ensure the buffer stays at a fixed size. 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu et al. so that the buffer was a circular buffer because circular 
buffers are easy to implement and ensure the buffer stays at a fixed size. 

In regard to claim 7, Basu et al. discloses the buffer is connected to be 
responsive to the acoustic detector (the buffer collects speech data), the buffer including 
a plurality of stages for storing sequential segments of the output of the acoustic 
detector, the delay arrangement being such that the contents of the memory element 
stage storing the beginning of a speech utterance are initially coupled to the speech 
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recognizer (speech is collected in the buffer, so that the speech can be sent for 
recognition if it is determined to be speech, column 15, lines 41-42 and lines 50-53). 

A buffer inherently collects sequential elements in a plurality of stages to ensure 
that the beginning of the elements are initially passed to the next stage. The buffer 
disclosed by Basu et al., therefore, inherently includes a plurality of stages for storing 
sequential segments of the output of the acoustic detector, the delay arrangement being 
such that the contents of the memory element stage storing the beginning of a speech 
utterance are initially coupled to the speech recognizer. 

Basu et al. does not disclose that the buffer is a circular buffer. 

The Applicant's admitted prior art discloses to use circular buffers when buffering 
data because they are easy to implement and ensure the buffer stays at a fixed size. 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu et al. so that the buffer was a circular buffer because circular 
buffers are easy to implement and ensure the buffer stays at a fixed size. 

In regard to claim 9, Basu et al. disclose the delay arrangement (buffer) is 
arranged for assuring that upon the completion of each speech utterance the acoustic 
detector is decoupled from the speech recognizer (recognition is performed for each 
piece of buffered data, until no speech is uttered, column 15, lines 53-55). 

In regard to claims 12 and 13, Basu et al. disclose the processing arrangement 
includes a face recognizer arranged for enabling the signal to have the first value in 
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response to the speaker being at a predetermined orientation relative to the visual 
detector connected to be responsive to the visual detector (a face is identified as 
'frontal' facing by frontal pose detector 20, column 7, lines 32-34 and column 15, lines 
37-38). 

10. Claims 14-17 and 25-27 are rejected under 35 U.S.C. 103(a) as being 
unpatentable over Basu et al. (U.S. Patent 6,594,629, hereinafter Basu 1), in view of 
Basu et al. (U.S. Patent 6,219,640, hereinafter Basu 2). 

In regard to claim 14, Basu 1 discloses the system has applications in speaker 
detection in an audience, as well as speaker recognition (identifying who is speaking), 
and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not explicitly disclose distinguishing the faces of a plurality of 
speakers and enabling the speaker to have the first value in response to a speaker 
having a recognized face. 

Basu 2 discloses a system for identifying a speaker that includes a face 
recognizer (Fig. 1 , face recognition 24) that distinguishes from a plurality of speakers 
(identifies the person speaking, column 6, lines 46-50). 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to identify who was speaking using a face recognizer and 
enable the speech recognizer when in response to a speaker having a recognized face, 
in order to get accurate recognition results when the user was in a crowd. 
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In regard to claim 15, Basu 1 discloses the system has applications in speaker 
detection in an audience, as well as speaker recognition (identifying who is speaking), 
and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not explicitly disclose including a speaker identity recognizer to be 
responsive to the acoustic detector, to distinguish speech patterns of a plurality of 
speakers and enable the signal to have the first value in response to the speaker having 
a recognized speech pattern. 

Basu 2, discloses a speaker identity recognizer (speaker recognizer 16) arranged 
for distinguishing speech patterns of a plurality of speakers (column 5, lines 29-34). 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to identify who was speaking using a speaker recognizer 

* 

enable the speech recognizer in response to the speaker's voice being recognized, in 
order to provide a second means for confirming the identity of the speaker, provide a 
backup if the facial recognizer had difficulty identifying the user. 

In regard to claims 16 and 17, Basu 1 discloses the system has applications in 
speaker detection in an audience, as well as speaker recognition (identifying who is 
speaking), and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not disclose the processing arrangement is arranged for causing the 
signal to have the first value in response to the speaker having a recognized face 
matched with a recognized speech pattern of the same speaker. 
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Basu 2 discloses a speaker is identified when a face is matched with a 
recognized speech pattern of the same speaker (joint identification module 30, column 
8, lines 43-47). 

It would have been obvious to one of ordinary skill in the art at the time of 

* 

invention to modify Basu 1 to enable the speech recognizer when the recognized face 
matched the recognized speech pattern of the same speaker, in order to improve the 
speaker recognition accuracy, as taught by Basu 2 (column 3, lines 32-36). 

In regard to claims 25 and 26, discloses the method has applications in speaker 
detection in an audience, as well as speaker recognition (identifying who is speaking), 
and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not explicitly disclose distinguishing the faces of a plurality of 
speakers and enabling the speaker to have the first value in response to a speaker 
having a recognized face. 

Basu 2 discloses a method for identifying a speaker that includes a face 
recognizer (Fig. 1 , face recognition 24) that distinguishes from a plurality of speakers 
(identifies the person speaking, column 6, lines 46-50). 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to identify who was speaking using a face recognizer and 
enable the speech recognizer when in response to a speaker having a recognized face, 
in order to get accurate recognition results when the user was in a crowd. 
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In regard to claim 27, Basu 1 discloses the method has applications in speaker 
detection in an audience, as well as speaker recognition (identifying who is speaking), 
and refers to the Basu 2 reference (column 17, lines 11-15). 

Basu 1 does not disclose storing images of the faces and speech patterns during 
a training period and comparing the stored images and speech patterns with images of 
the face of the speaker and the speech pattern of the speaker. 

Basu 2 discloses storing: 

(1 ) images of the faces of a plurality of speakers (column 7, lines 27-28), and 

(2) the speech patterns of the same plurality of speakers during at least one 
training period (column 6, lines 17-20, see also column 14, lines 10-12); and 

performing the distinguishing steps by comparing the stored images and speech 
patterns with images of the face of the speaker and the speech pattern of the speaker 
(joint identification, column 8, lines 43-47) 

It would have been obvious to one of ordinary skill in the art at the time of 
invention to modify Basu 1 to store images of the faces of a plurality of speakers and 
the speech patterns of the same plurality of speakers during a training period, since 
these are necessary to later identify the users. Furthermore, it also would have been 
obvious to one of ordinary skill in the art at the time of invention to enable the speech 
recognizer when the recognized face matched the recognized speech pattern of the 
same speaker, in order to improve the speaker recognition accuracy, as taught by Basu 
2 (column 3, lines 32-36). 
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Conclusion 


1 1 . Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to Brian L. Albertalli whose telephone number is (571) 272- 
7616. The examiner can normally be reached on Mon - Fri, 8:00 AM - 5:30 PM, every 
second Fri off. 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Wayne Young can be reached on (571 ) 272-7582. The fax phone number 
for the organization where this application or proceeding is assigned is 571-273-8300. 

Information regarding the status of an application may be obtained from the 
Patent Application Information Retrieval (PAIR) system. Status information for 
published applications may be obtained from either Private PAIR or Public PAIR. 
Status information for unpublished applications is available through Private PAIR only. 
For more information about the PAIR system, see http://pair-direct.uspto.gov. Should 
you have questions on access to the Private PAIR system, contact the Electronic 
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Business Center (EBC) at 866-217-9197 (toll-free). 



^ W. R young 
WARY EXAMINER 


